We can add the concept of “momentum” to vanilla gradient descent by including a scaled term corresponding to the change in the parameters from the last iteration :

Notice that the velocity vector at . Hence the first update is “slower” than it would otherwise have been (by a factor of ). Technically, this slowness persists throughout training, decaying exponentially over the subsequent updates. Of course, the initial zero momentum is completely negligible after a small number of updates.